Causal models for qualitative and mixed methods inference
Mixed methods
Macartan Humphreys and Alan Jacobs
1 Population parameters
2 Population parameters
2.1 Population-level causal questions
Average causal effects for a population
e.g., What is the average effect of \(X\) on \(Y\)
Proportion of different effects in a population
What share of cases in the population have positive effects?
What share have negative effects?
Causal pathways
e.g., How commonly does \(X\) affect \(Y\) through \(M\) (vs. \(N\)) in the population?
2.2 Causal queries on a DAG: population-level average causal effect
What is the average effect of \(X\) on \(Y\) in the population?
This is a question about the values in \(\lambda^Y\)
2.3 Causal queries on a DAG: population-level average causal effect
With binary variables, the average effect is always the difference between
The share of cases with a positive effect
The share of cases with a negative effect
\(\lambda_{01} - \lambda_{10}\)
2.4 Causal queries on a DAG: population-level average causal effect
\(ATE = \lambda^Y_{01} - \lambda^Y_{10}\)
2.5 ATE with a mediator
We need to
Combine two ways of generating positive effects
Combine two ways of generating negative effects
Subtract the first from the second
2.6 ATE with a mediator
Two ways of generating positive effects: \(\lambda^M_{01} \times \lambda^Y_{01}\) + \(\lambda^M_{10} \times \lambda^Y_{10}\)
Two ways of generating negative effects: \(\lambda^M_{10} \times \lambda^Y_{01}\) + \(\lambda^M_{01} \times \lambda^Y_{10}\)
Subtract the first from the second
2.7 Causal queries on a DAG: how often does \(X\) have a positive effect on \(Y\)?
Proportion of cases with positive effect = \(\lambda^Y_{01}\)
2.8 Causal queries on a DAG: how often does \(X\) have a positive effect on \(Y\)?
Proportion of cases with positive effect = \(\lambda^M_{01} \times \lambda^Y_{01}\) + \(\lambda^M_{10} \times \lambda^Y_{10}\)
2.9 Causal queries on a DAG: pathway questions
We can also pose the pathway query at population level
What is the share of cases for which \(X\) has a positive effect on \(Y\) through \(M\)?
A question about joint \(\lambda^M\) and \(\lambda^Y\) distributions
2.10 How will we answer questions about populations?
We need to learn about those \(\lambda\)’s
About the proportions of the population with different kinds of causal effects
We will have a prior belief about those proportions
When we see data on lots of cases, we will update those beliefs about proportions
From a prior distribution over \(\lambda\) to a posterior distribution over \(\lambda\)
3 Intuition
3.1 How do we “update” our models?
We’ve talked about process tracing a single case to answer a case-level query
Here the model is fixed
We use the model + case data to answer questions about the case
We can also use data to “update” our models
Use data on many cases to learn about causal effects in the population
Allows mixing methods: using data on lots of cases, we can learn about probative value of process-tracing evidence
The core logic: we learn by updating population-level causal beliefs toward beliefs more consistent with the data
3.2 Start with a DAG
3.3 Large-\(N\) estimation of \(ATE\): what happens to beliefs over parameters
We collect data on \(I\), \(M\), and \(D\) for a large number of cases
We observe a strong positive correlation
We will think there’s a positive average effect
3.4 Large-\(N\) estimation of \(ATE\): what happens to beliefs over parameters
Now, we will update on both \(\lambda^I\) , \(\lambda^M\) and \(\lambda^D\)
An \(I \rightarrow D\) effect can only happen if \(I\) affects \(M\) and \(M\) affects \(D\), in specific ways
Two possible combinations of effects can generate a positive \(I \rightarrow D\) effect
\(I \rightarrow M\) is positive, \(M \rightarrow D\) is positive
\(I \rightarrow M\) is negative, \(M \rightarrow D\) is negative
So we will come to put more weight on a joint distribution of \(\lambda^M\) and \(\lambda^D\) in which there are lots of cases with one of these two combinations
…and less posterior weight on all other combinations of effects
3.5 General procedure
Key insight:
If we suppose a given set of parameter values we can figure out the likelihood of the data given those values.
We can do this for all possible parameter values and see which ones are more in line with the data
# add likelihood and calculate posteriorx <- x |>rowwise() |># Ensures row-wise operationsmutate(likelihood =dmultinom(c(400, 100, 100, 400),prob =c(b + c, a + c, a + d, b + d) /2 ) ) |>ungroup() |>mutate(posterior = likelihood /sum(likelihood))
3.6.6 Causal inference on a grid: execution
x |>mutate(likelihood =formatC(likelihood, format ="e", digits =2),posterior =formatC(posterior, format ="e", digits =2)) |>head() |>kable(digits =2)
a
b
c
d
likelihood
posterior
0.30
0.17
0.53
0.00
2.10e-212
1.72e-209
0.50
0.21
0.29
0.00
1.26e-221
1.03e-218
0.11
0.38
0.13
0.39
7.80e-46
6.38e-43
0.63
0.02
0.14
0.20
0.00e+00
0.00e+00
0.48
0.09
0.18
0.24
1.97e-231
1.61e-228
0.27
0.07
0.52
0.15
3.99e-194
3.26e-191
3.6.7 Causal inference on a grid: inferences
x |>summarize(a =weighted.mean(a, posterior),b =weighted.mean(b, posterior),ATE = b - a) |>kable(digits =2)
a
b
ATE
0.1
0.69
0.59
3.6.8 Causal inference on a grid: inferences
x |>ggplot(aes(b, a, size = posterior)) +geom_point(alpha = .5)
Spot the ridge
3.7 In sum: learning from data
For any data pattern, we gain confidence in parameter values more consistent with the data
For single-case inference, we must bring background beliefs about population-level causal effects
For multiple cases, we can learn about effects from the data
Large-\(N\) data can thus provide probative value for small-\(N\) process-tracing
All inference is conditional on the model
4 Mixed methods
4.1 A DAG
We’ll want to learn about the \(\theta\)’s and the \(\lambda\)’s
We need to observe nodes to learn about other nodes
We can potentially observe 3 nodes here: \(X, M\), and \(Y\)
4.2 A typical “quantitative” data structure
Data on exogenous variables and a key outcome for many cases
E.g., data on inequality (\(I\)) and democracy (\(D\)) for many cases
4.3 A typical “qualitative” data structure
Data on exogenous variables and a key outcome plus elements of process for a small number of cases
Finite resources mean tradeoffs between extensive and intensive data collection
E.g., data on inequality (\(I\)), mass mobilization (\(M\)), and democracy (\(D\)) for many cases
4.4 Mixing qualitative and quantitative
What if we combine extensive data on many cases with intensive data on a few cases?
A non-rectangular data structure
4.5 Non-rectangular data
A data structure that neither standard quantitative nor standard qualitative approaches can handle in a systematic way
Not a problem for the Integrated Inferences approach
We simply ask:
Which causal effects in the population are most and least consistent with the data pattern we observe?
That is, what distribution of causal effects in the population, for each node, are most consistent with this data pattern?
CausalQueries uses information wherever it finds it
4.6 Mixing in practice
For Bayesian approaches this mixing is not hard.
Critically though we maintain the assumption that cases for “in depth” analysis are chose at random—otherwise we have to account for selection processes.
What is the probability of seeing these two cases:
Say we just observe a positive Inequality-Democratization correlation
Could be because Inequality causes Democratization
Could be because of confounding
4.9.2 How qual can inform quant: confounding
Remember
Observing \(M\) helps
Process data helps address the deep problem of confounding
Key point: we don’t need \(M\) for all cases
Can learn from \(I\) and \(D\) for lots of cases and \(M\) for a subset
4.9.3 How qual can inform quant: observable confounder
Another example: \(M\) as the confound
4.9.4 How qual can inform quant: observable confounder
How much can we learn from \(M\) data for some cases?
4.9.5 How quant can inform qual: getting probative value of a clue from the data
Suppose we go to the field and we learn that mass mobilization DID occur in Malawi
So \(M=1\)
What can we conclude?
NOTHING YET!
4.9.6 How quant can inform qual: getting probative value of a clue from the data
The pure process-tracing solution: assign our beliefs about causal effects in the population
E.g., beliefs that linked positive effects are more likely than linked negative effects
Meaning that \(M=1\) in an \(I=1, D=1\) case speaks in favor of \(I=1\) causing \(D=1\)
The mixed-methods solution: learn about population-level effects from large-\(N\) data
4.9.7 How quant can inform qual: getting probative value of a clue from the data
Suppose we have data on \(I\), \(D\), and \(M\) for a large number of cases
Suppose we observe a strong positive correlation across all 3 variables
What have we learned, under this model?
Positive \(I \rightarrow M\) effects more likely than negative
Positive \(M \rightarrow D\) effects more likely than negative
So linked positive effects more common than linked negative effects
Meaning that \(M=1\) in an \(I=1, D=1\) case speaks in favor of \(I=1\) causing \(D=1\)
But now we’ve now drawn our population-level beliefs from the data
Now, we can go and process-trace
Did high inequality cause democratization in Malawi?
Observe \(M\)
With conclusions grounded in case-level AND population-level evidence
5 Mixed methods in CausalQueries
5.1 Big picture
CausalQueries brings these elements together by allowing users to:
Make model: Specify a DAG: CausalQueries figures out all principal strata and places a prior on these
Update model: Provide data to the DAG: CausalQueries writes a stan model and updates on all parameters
Query model: CausalQueries figures out which parameters correspond to a given causal query
5.2 Illustration \(X \rightarrow Y\) model
Consider this problem:
Y = 0
Y = 1
X = 0
\(n_{00}\)
\(n_{01}\)
X = 1
\(n_{10}\)
\(n_{11}\)
where \(X\) is randomized, both \(X\), \(Y\) binary
5.3 Model, update, query
data =fabricate(N =1000, X =rbinom(N, 1, prob = .5), Y =rbinom(N, 1, prob = .2+ .4*X))model <-make_model("X -> Y") |>update_model(data)
5.4 Model, update, query
model |>inspect("posterior_distribution")
posterior_distribution
Summary statistics of model parameters posterior distributions:
Distributions matrix dimensions are
4000 rows (draws) by 6 cols (parameters)
mean sd
X.0 0.48 0.02
X.1 0.52 0.02
Y.00 0.28 0.07
Y.10 0.12 0.07
Y.01 0.50 0.07
Y.11 0.11 0.07
5.5 Model, update, query
model |>grab("posterior_distribution") |>ggplot(aes(Y.01, Y.10)) +geom_point(alpha = .2)
Posterior draws
5.6 Model, update, query
model |>query_model(query =c(ATE ="Y[X=1] - Y[X=0]", POS ="Y[X=1] > Y[X=0]", SOME ="Y[X=1] != Y[X=0]" ),using =c("priors", "posteriors")) |>plot()
5.7 Generalization: Procedure
The CausalQueries approach generalizes to settings in which nodes are categorical:
Identify all principal strata: that is, the universe of possible response types or “causal types”: \(\theta\)
Define as parameters of interest the probability of each of these response types: \(\lambda\)
Place a prior over \(\lambda\): e.g. Dirichlet
Figure out \(\Pr(\text{Data} | \lambda)\)
Use stan to figure out \(\Pr(\lambda | \text{Data})\)
5.8 Generalization: Procedure
Also possible when there is unobserved confounding
…where dotted lines means that the response types for two nodes are not independent
5.9 Illustration: “Lipids” data
Example of an IV model. What are the principle strata (response types)? What relations of conditional independence are implied by the models?
data("lipids_data")lipids_data |>kable()
event
strategy
count
Z0X0Y0
ZXY
158
Z1X0Y0
ZXY
52
Z0X1Y0
ZXY
0
Z1X1Y0
ZXY
23
Z0X0Y1
ZXY
14
Z1X0Y1
ZXY
12
Z0X1Y1
ZXY
0
Z1X1Y1
ZXY
78
Note that in compact form we simply record the number of units (“count”) that display each possible pattern of outcomes on the three variables (“event”).[^1]
5.10 Model
model <-make_model("Z -> X -> Y; X <-> Y") model |>plot()
5.11 Updating and querying
Queries can be condition on observable or counterfactual quantities